In previous sessions we have seen how to handle unaligned sequence data as well as where and how to retrieve genome reference data in Fasta format.
.pull-left[]
.pull-right[]
We now need a suitable aligner software to place unaligned reads onto our reference genome to produce a SAM/BAM file.
Aligner softwares can be broadly placed into two categories.
Popular genomic aligners include Bowtie, bwa, subread,GMAP.
Popular splice aware aligners include Tophat, SpliceMap, subjunc, GSNAP, STAR.
A splice aware aligner is important for analysis of RNAseq where mRNA’s introns are spliced to stitch exons into continuous sequence.
A few of the popular aligners are wrapped up in R/Bioconductor packages allowing us to use our aligner software from R as well as make use of some of the R/Bioconductor objects we are growing to love.
The rsubread and gampR packages offer convenient access to subread/subjunc and gmap/gsnap on Mac and Linux but sadly are not implemented on Windows.
The QuasR package offers an interface to Bowtie and SpliceMap on Windows, Mac and Linux and so provides access to a genomic and splice aware aligner on all systems.
The Rbowtie2 package offers a wrapper to the popular Bowtie2 software and offers significant improvement in memory and CPU usage to Bowtie packaged in QuasR.
In this session we will focus on the QuasR package for all users but recommend rsubread for analysis.
In this session we will be making use of some public datasets from the Encode consortium.
We will be using raw sequence reads in fastQ format which have been generated from an RNAseq experiment.
This RNAseq data has been generated from the human cell line GM12878 and the link to experiment can be found here or a direct link to FastQ for replicate 2 we are using can be found here.
To speed up the processing for this practical I have retrieved aligned data from Encode for the sample ENCSR297UBP and extracted reads mapping to ActB gene on hg20 human genome build. I have further downsampled these reads to only 10000 reads out of the 200000 mapping to this gene.
This sampled data can be found in Data/sampledActin.fq.gz
We will also require some reference data so lets install the BSgenome package for hg38
source("https://bioconductor.org/biocLite.R")
biocLite("BSgenome.Hsapiens.UCSC.hg38")
library(BSgenome.Hsapiens.UCSC.hg38)The sample table requires is a tab-delimited file listing the path to fastq(s) to be aligned and the desired sample names.
FileName <- "Data/sampledActin.fq.gz"
SampleName <- "sampledActin"
sampleTable <- data.frame(FileName,SampleName)
write.table(sampleTable,file="sampleTable.txt",sep="\t",quote=FALSE,row.names = FALSE)## FileName SampleName
## 1 Data/sampledActin.fq.gz sampledActin
Internally, QuasR will create a FASTA file from our BSgenome object prior to alignment.
We can also provide a FASTA file directly to the qAlign() function.
First lets extract a FASTA file from out BSgenome object. Here we will create a FASTA file for just Chr7 (the location of ActB)
library(BSgenome.Hsapiens.UCSC.hg38)
chr7hg38 <- BSgenome.Hsapiens.UCSC.hg38[["chr7"]]
chr7hg38Set <- DNAStringSet(list(chr7=chr7hg38))
writeXStringSet(chr7hg38Set,file="chr7.fa")The qAlign() function will allow us to retrieve the aligned data as BAM.
Our generated BAM file will be in same directory as our fastq file. By default QuasR will add a random string to ensure we do not write over previous files.
Data/sampledActin_29a3a870bf.bam
The samtools software provide command line tools to handle SAM and BAM files such as indexing and sorting.
The Rsamtools package allows us to make us of the samtools functions within R.
First we can install and load the library.
source("https://bioconductor.org/biocLite.R")
biocLite("Rsamtools")
library(Rsamtools)After sorting, we can now index our sorted BAM file using the indexBAM() function.
indexBam("SortedActB.bam")## SortedActB.bam
## "SortedActB.bam.bai"
We can now review our BAM file in IGV.
We are missing reads which would align across splice junctions. To use the qAlign() function to align spliced reads we must simply specify the parameter splicedAlignment to be TRUE.
qAlign("sampleTable.txt","chr7.fa",splicedAlignment = TRUE)We can get an overview of BAM file information using the quickBamFlagSummary() function.
quickBamFlagSummary("SortedActBSpliced.bam")
## group | nb of | nb of | mean / max
## of | records | unique | records per
## records | in group | QNAMEs | unique QNAME
## All records........................ A | 10000 | 10000 | 1 / 1
## o template has single segment.... S | 10000 | 10000 | 1 / 1
## o template has multiple segments. M | 0 | 0 | NA / NA
## - first segment.............. F | 0 | 0 | NA / NA
## - last segment............... L | 0 | 0 | NA / NA
## - other segment.............. O | 0 | 0 | NA / NA
##
## Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
## Indentation reflects this.
##
## Details for group S:
## o record is mapped.............. S1 | 9037 | 9037 | 1 / 1
## - primary alignment......... S2 | 9037 | 9037 | 1 / 1
## - secondary alignment....... S3 | 0 | 0 | NA / NA
## o record is unmapped............ S4 | 963 | 963 | 1 / 1Now if we compare to the original alignment from Encode we can identify where some of our unaligned reads may have gone.
The Rbowtie2 library offers a faster alternative to genomic alignment in R using Bowtie2.
library(Rbowtie2)Now we have created an index to align to, we can align our fastq data to this index.
First however we will need to decompress our compressed fastq files to use them with Rbowtie2. The gunzip function in R allows us to decompress file from R.
library(R.utils)
gunzip("Data/sampledActin.fq.gz")The bowtie2 function outputs a SAM file. We will want to produce a BAM file we can sort and index from this SAM file using the asBam function.
bamFile_Bowtie2 <- asBam("sampledActin.sam")
bamFile_Bowtie2Although Rbowtie2 is a significant improvement over QuasR it only allows for genomic alignment and not splice aware alignment.
This means that Rbowtie2 is suitable for - ATAC-seq - ChIP-seq - WGS
Rbowtie2 is NOT suitable for - RNA-seq - Ribo-seq
The Rsubread package requires we first build an index of our reference genome for use during alignment.
Here we use the buildindex() function, specifying the name of index to be built and the FASTA file to build the index from.
buildindex("chr7","chr7.fa")We can now use the subjunc() function to align reads in a splice aware manner.
We must again specify out index name, fastq file, desired output format and name of BAM file to be created.
subjunc("chr7","Data/sampledActin.fq.gz",
output_format = "BAM",
output_file = "Data//RsubreadsampledActin.bam")Finally we can assess mapping rate.
quickBamFlagSummary("Sorted_RsubreadsampledActin.bam")## group | nb of | nb of | mean / max
## of | records | unique | records per
## records | in group | QNAMEs | unique QNAME
## All records........................ A | 10000 | 10000 | 1 / 1
## o template has single segment.... S | 10000 | 10000 | 1 / 1
## o template has multiple segments. M | 0 | 0 | NA / NA
## - first segment.............. F | 0 | 0 | NA / NA
## - last segment............... L | 0 | 0 | NA / NA
## - other segment.............. O | 0 | 0 | NA / NA
##
## Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
## Indentation reflects this.
##
## Details for group S:
## o record is mapped.............. S1 | 9726 | 9726 | 1 / 1
## - primary alignment......... S2 | 9726 | 9726 | 1 / 1
## - secondary alignment....... S3 | 0 | 0 | NA / NA
## o record is unmapped............ S4 | 274 | 274 | 1 / 1